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Abstract 


Expectation Propagation (Minka 2001) is a widely successful algorithm for variational inference. EP is an iterative 
algorithm used to approximate complicated distributions, typically to find a Gaussian approximation of posterior 
distributions. In many applications of this type, EP performs extremely well. Surprisingly, despite its widespread use, 
there are very few theoretical guarantees on Gaussian EP, and it is quite poorly understood. 

In order to analyze EP, we first introduce a variant of EP: averaged-EP (aEP), which operates on a smaller parameter 
space. We then consider aEP and EP in the limit of infinite data, where the overall contribution of each likelihood term 
is small and where posteriors are almost Gaussian. In this limit, we prove that the iterations of both aEP and EP are 
simple: they behave like iterations of Newton’s algorithm for finding the mode of a function. We use this limit behavior 
to prove that EP is asymptotically exact, and to obtain other insights into the dynamic behavior of EP: for example, 
that it may diverge under poor initialization exactly like Newton’s method. EP is a simple algorithm to state, but a 
difficult one to study. Our results should facilitate further research into the theoretical properties of this important 
method. 


Introduction 


Current practice in Bayesian statistics favors MCMC methods, but so-called variational approximations are gaining trac¬ 
tion. In machine learning, where time constraints are primary, they have long been the favored method for Bayesian 


inference (Bishop 2007). Variational methods provide fast, deterministic approximations to arbitrary distributions. Ex¬ 


amples include mean-field methods (Wainwright and Jordan 2008), INLA (Integrated Nested Laplace Approximation, 
Rue et al. 20091, and Expectation Propagation (EP). 


EP was introduced in Minka (2001) and has proved to be one of the most durably popular methods in Bayesian machine 


learning. It gives excellent results in important applications like Gaussian process classification (Kuss and Rasmussen 


2005 Nickisch and Rasmussen 2008) and is used in a wide range of applications (e.g., Jylanki et al. 2014 2011 Gehre 
and Jin| 2013 Ridgway et al. 2014 ). Recently EP has been shown to work very well in certain difficult likelihood-free 
settings (Barthelme and Chopin, 2014]), and has even been advocated as a generic form of inference in large-data problems 
(Gelman et al. 2014 Xu et al. |2014| ), since EP is easy to parallelize. 

Most of the work on EP concerns applications, and focuses on making the method work well in various settings. Why 
and when the method should work remains somewhat of a mystery, and in this article we aim to make progress in that 
direction. A few theoretical results are available when the approximating family is a discrete distribution, in which case EP 


is equivalent to Belief Propagation, a well-studied algorithm (Wainwright and Jordan 2008). The typical case in Bayesian 


inference is to use multivariate Gaussians as the approximating family, but very little is known about that case: |Ribeiro 
and Opper (20111 study a limit of EP for neural network models (the limit of infinitely many weights) and Titterington 


(20111 gives partial results on mixture models in the large-data limit. Despite these efforts, two aspects of EP’s behavior 


have remained elusive: its dynamical behavior (does the EP iteration converge on a fixed datasetl) and its large-data 
behavior (do fixed points of the iteration converge to the target distribution in the limit of infinite data?). In this work, 
we focus on the dynamical behavior of EP and show that it is asymptotically equivalent to the behavior of Newton’s 


method (Nocedal and Wright 2006). This enables us to prove that EP is exact in the large-data limit: if the posterior in 


the large-data limit tends to a Gaussian (as they usually do), then EP recovers the limiting Gaussian. Furthermore, we 
show that on multimodal distributions, EP often has one fixed-point for each mode. This also yields insights into why EP 
iterations can be so unstable. 

The outline of the paper is as follows. In section |T| we give a quick introduction to EP and introduce a simpler variant which 
we call averaged-EP (aEP). aEP is mathematically simpler than EP because it iterates over a much smaller parameter 
space (independent of n , the number of data points), which makes our results easier to state and to understand. We then 
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present our theoretical contributions in section [2j Our main result concerns the asymptotic behavior of the EP update, 
which turns out to be extremely simple. This asymptotic behavior has many consequences, of which we highlight two. 
First, EP and aEP asymptote to Newton’s algorithm. Second, EP is asymptotically exact, or more specifically the target 
distribution and one specific EP fixed-point converge in total-variation distance. In section [3] we then show that this 
Newton limit behavior of EP can give us some intuition into how the iterations of the algorithm work. Finally, in section 
[4] we discuss limitations of our results and give directions for future work. 

Notation and background 

Vectors are in bold, matrices are in bold and capitalized. Given a multivariate function /(x), we note V/ its gra¬ 
dient and Hf its Hessian, the matrix of the second derivatives. Univariate Gaussian distributions are represented 
as Af(x\fi,v) oc exp ^ (x — fi) 2 ^j, although occasionally the exponential parameters /3 = v~ l , r = /3fi are used: 

J\f(x\r,/3) oc exp (— \/3x 2 + rx'). We call /3 the precision and r the linear shift. Table |T] provides a lexicon for EP 
and a summary of the notation. 

The goal of EP is to compute a Gaussian approximation of a target distribution, which we note p (x) oc exp (—ip (x)). This 
distribution factorizes into n factor-functions (sites in EP terminology): p (x) = Y\a=i 0 ( x )- We note (pi (x) = — log (U (x)). 
EP produces a Gaussian approximation q (x) ss p (x) with the same factor structure, n Gaussian factors ft (x) such that: 
q (x) = nlLi fi ( x )- Each Gaussian factor fi (x) approximates the corresponding target factor Zj (x) . 

Newton’s algorithm as an approximate inference method 

Approximate inference methods aim to find a tractable approximation q(x) to a complicated density p(x). Most of them 
operate by solving: 


argmin D(p\ \q) 
q£Q 


where Q denotes some set of tractable distributions and D is a divergence measure. Depending on the choice of divergence 
measure and approximating distribution, one can derive various variational algorithms. These methods are often iterative 
and produce a sequence of approximations q ±,..., q? that should hopefully tend to a locally optimal approximation. 


One of our key results proves that, in the large-data limit (denoted here by n —> 00 ), EP behaves like Newton’s algorithm 
(NT, see e.g. Nocedal and Wright, 2006 for an introduction). NT aims to find a mode of a target probability distribution 
p(x) oc exp (—ip (a;)) through an iterative procedure. We present here the one-dimensional version. Once initialized at a 
point fii, a sequence of points (fit) is constructed with: 


Mt+i — fit — 


ip (Mt) 


0 (Mt) 


(1) 


This iteration can be viewed as a gradient descent with a Hessian correction. It can also be viewed as approximating 
log(p) as its second degree Taylor expansion around fi t , and then setting fit+i as the extremum of that polynomial. 

With a slight modification we can restate NT as an approximate inference algorithm iterating on Gaussian approximations 
of p, which makes the parallel to EP more obvious. Starting from an arbitrary Gaussian g 1 , with mean fi 1 , we construct 
a sequence of Gaussian approximations ( g t ) through iterating the following steps: 


1. Compute 5r t +1 = -ip (fit) and /3 t+ i = ip ’ (fi t ) 

2. Compute a Gaussian approximation to p ( x ): g t +i (x) oc exp ^<5r t +i (x — fi t ) — Pt+i < ' x ~^ 


3. Compute the mean of g t+ 1 : fH+\ = (it — 


-1 


ip (fit) ip (fit) 


With this change, the fixed point of NT is now the Gaussian distribution A f x 


iP" (x*) 


centered at x*, the mode 


of p, and with precision the Hessian of log(p) at the mode x*. Thus, the fixed point of this NT variant is the canonical 
Gaussian approximation (CGA) at the mode of p , also sometimes referred to as the “Laplace” approximation (which is 
erroneous as the Laplace approximation actually refers to approximating integrals and not probability distributions). 
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Term 

Explanation 

Target distribution 

The distribution we wish to approximate: 

P(x) cx nr=i h ( x ) 

EP approximation 

An exponential-family distribution with the same 
factor structure as p(x), 

afxlocTT” f (x) - exp (^ A ‘ s(x) ) 
q cx 1 Yi=i h V X J - Z(J2 Ai) 

“Site” or “factor” 

A factor h (x) in the target distribution 

Site approximation 

A factor /j (x) in the approximation 

Cavity prior 

The approximate distribution with site i taken out, 
i.e. q-i (x) oc exp X j s ( X )J • In aEP, 

g (x) oc (x) is independent of i. 

Hybrid distribution 

The product of a cavity prior and a true site, i.e.. 
hi (x) oc q-i (x) k (x) 


Table 1: An EP lexicon 


An important issue is the convergence of NT. It has fast convergence when initialized close to a mode of p. Technically, 
convergence is quadratic, i.e. \pt+x — x*\ < c(/it — x *) 2 . However, that is only true in a neighborhood of the mode and the 
basic version of the algorithm, which we presented here, does not generally converge for all starting points p,\. In order to 
obtain an algorithm with guaranteed convergence, one solution is to complement NT with a line-search algorithm. As we 
shall see, EP can also have unstable behavior when initialized too far from its fixed points: we return to this important 
issue in section 13.11 


Log-concave distributions and the Brascamp-Lieb theorem 


Our theoretical results depend on a very powerful theorem on log-concave probability distributions, called the Brascamp- 
Lieb theorem (Brascamp and Lieb 1976 Saumard and Wellner[ [2014 1. Let LC (x) oc exp(—-0(x)) be a log-concave 
distribution (i.e., Hip (x) is always symmetric positive definite). The variance of any statistic S (x) is then bounded 
according to: 


var iC (S(x)) < E lc ((VS) T [. HiP(x )}~ 1 VS) (2) 

We use this result in the particular case S (x) = x from which we get an upper-bound on the variance: 

var LC (x) < E lc ([Hip (x)] _1 ) (3) 

Further more, if the log-Hessian is lower-bounded (as a matrix inequality ):Hip (x) > B m , then the variance has an even 
simpler upper-bound: 

var LC (x) < B ," 1 (4) 


1 From classic EP to averaged-EP (aEP) 


Classic EP 


In this section we introduce EP in the exponential-family notation used by Seeger (2005), because it is neat, generic and 


compact. EP has been introduced from a variety of viewpoints, and the versions given in Minka (20051; Seegerl (2005); 


Bishop l 

2007); 

Raymond et al. 1 

2014) are all potentially useful. 

Following 

Minka I 

20051, given a target distribution p(x), EP aims to solve 


argmin KL(p\\q) (5) 

qGQ 


where Q is an approximating family and KL denotes the Kullback-Leibler divergence. Here we focus on the Gaussian case 
but other exponential families may be used (for example, the Gaussian-Wishart family is used in Paquet et al. 2009). 
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A central aspect of EP is that it relies on a factorization of p, i.e. that the posterior decomposes into a product of terms: 


1 

P( x ) = II li ( 6 ) 

where usually one of the terms corresponds to the prior and the rest to independent likelihood terms (here and elsewhere 
Z = f n"=i k (x) dx is a normalization constant). The decomposition is non-unique and the performance and feasibility 
of EP depend on the factorization one picks. The approximation has the same factor structure: 


?(x) oc J|/i(x) (7) 

i=1 

Following Seeger, we call the If s sites and the corresponding ffs site approximations. The site approximations have 
exponential-family form (e.g., Gaussian) 


ft (x) = exp (Aft (x)) 

which the approximation inherits 


q\ s (x) = exp { A s t (x) cf> (A s )} (8) 

where A s = J^A,. Note that A s represents the so-called natural parameters for the approximation. According to a 
well-known property of exponential families, the gradient of the partition function, V</>(A) returns the expected value of 
the sufficient statistics for a given value of the natural parameters: 

T 7 = V0(A) = -^:/t(x)e*p { A‘t(x)}dx 

Its inverse V<(> -1 transforms expected values of the sufficient statistics into natural parameters. 

A well known result for exponential families shows that the global solution of problem (|5| is a moment-matching solution: 

V* = E p {t (x)) 

In the Gaussian case, what this means is that the best approximation of p according to KL divergence is a Gaussian with 
the same mean and covariance. Of course, directly computing the mean and covariance of p is intractable, and so EP tries 
to get there by successive refinements of an approximation. 

Specifically, EP tries to improve the approximation sequentially by introducing hybrid distributions which interpolate 
between the current approximation and the true posterior. A hybrid distribution hi contains one site from the true 
posterior, but all the rest come from the approximation: 

hi (x) oc q-i (x)Zi(x), g-i(x) = JJ/j(x). (9) 

Hybrids should be tractable, meaning that one should be able to compute their moments quickly. Note that in exponential- 
family notation, the q-i (x) distribution is simply: 

<?-i(x) oc exp {(A s - Ai)i(x)} 

EP improves the approximation sequentially by (a) picking a site i (b) computing the moments of the hybrid hi and (c) 
setting A s such that the moments of q\ s match the moments of the hybrid. 

Classic EP (Alg. [l]) loops over the sites sequentially. 

A parallel variant forms all the hybrids at once, looping several times over the whole dataset (Alg. [2]). 
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Algorithm 1 Classic EP in exponential family form 
Loop until convergence 
For i in 1... n 

1. Compute “cavity” parameter A_^ A s — A, 

2. Form hybrid distribution and compute its moments 

Vi = Y- f * ( X ) li exp ( X ) _ ^ (A-i)} dx 

3. Update global parameter A s <— V0 _1 (r]i), site parameter A, •<— A s — JU A j 


Algorithm 2 Parallel EP 
Loop until convergence 

1. Process all hybrids: for i. in 1... n 

(a) Compute “cavity” parameters A_^ ■<— \ s — A* 

(b) Form hybrid distribution and compute moments 

r 1i = Y j 1 ( x ) l i ( x ) ex P {Aljt ( x ) - </> (A_i)} dx 

(c) Compute local update A, V^ -1 (rji) — Aj 

2. Update global parameters \ s v- ^ A,; 


Averaged EP 

We introduce a simpler variant of EP with a drastically reduced parameter set: namely, we get rid of all site-specific 
parameters Aj and keep only global parameters A s . The resulting algorithm is simpler to analyze. Our variant is 
straightforward, and follows from setting A; = d A s for all i, under the assumption that the contributions from all sites 
are be similar. 

Proceeding step-by-step from alg. [5] we begin with the cavity parameter, which becomes A c = A s — ^\ s = 
independent of i. We use the cavity parameter to form hybrid distributions just as before: 


hi (x) oc U (x) exp 


n — 1 


A s t (x) 


The moments of the hybrids are again noted rji , and inserting the local updates into the update for the global parameter 
we get (recall that VtjU 1 transforms moment parameters into natural parameters): 


Al = ^{Vr 1 (T7i)-A c } = ^|vr 1 (r7i)- ! ^A s J 

= V< ^ _1 fa) - ( n ~ !) 

It is interesting to examine the fixed points of this update rule, which satisfy: 


(10) 


a: = ^ V0 -1 fa | A*) — (n — 1) A* 


or equivalently: 
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Algorithm 3 averaged-EP 
Loop until convergence 

1. Compute “cavity” parameters A c •<— A s 

2. For i in 1... n, form hybrid distribution and compute moments 

Vi y J t ( x ) li ( x ) ex P { A c* ( x ) - 4 1 (A c )} dx 

3. Update global parameters A ' s (rji) — (n — 1) A s 


where the hybrid moments rji depend implicitly on A s . The following averaging rule shares the same fixed points: 

a : = -£ v * -1 (*!*-) ( 11 ) 

and that is the rule that gives averaged-EP (aEP) its nam}}] 

The resulting method is given in Alg. [3] but can be summarized in a few words. To improve an exponential-family 
approximation aEP begins by forming n hybrids of the approximation and the true posterior, it computes their moments, 
uses those to compute the new site approximations and the corresponding natural parameters, and sets the new natural 
parameters of the approximation to the sum of the site approximations. 


2 Asymptotic behavior of the EP and aEP updates 

In this section, we investigate the dynamics of the EP and aEP algorithms. We first present a new key result on the 
asymptotic behavior of the EP approximation of a site: we show that as the variance of the cavity converges to 0, the 
approximation converges to a simple Taylor approximation of log (li). This asymptotic behavior has several consequences 
but we present here the most important one: in the limit where all cavity priors q_i have small variance, the parallel EP 
and the aEP updates converge towards the updates of Newton’s algorithm. A corollary is that, for multimodal target 
distributions, all modes which have sufficient curvature have an associated EP fixed point and that, as a certain measure 
of mode peakedness goes to infinity, the EP fixed point converges to the CGA at that mode. Finally, this enables us to 
prove that EP is asymptotically exact in the large-data limit (if the CGA also is). 


2.1 Assumptions 

Throughout this section, we work in the one-dimensional case since it is the easiest to understand. All results are straight¬ 
forward to extend to the p-dimensional case (i.e., when the target distribution is p— dimensional), the most significant 
difficulty being notation. In the appendix, we give the proofs for the p-dimensional case. 

We use two assumptions on the sites U ( x ). 

Both our conditions concern the negative log-likelihood of the sites <f>i ( x ) = — log (/. t (a;)). Our first assumption is that the 
second log-derivative of the sites has a bounded range: there exists B such that: 

Vi, a; max (d> i (x)j — min ^ (x)j <B (12) 


Our second condition concerns bounding some higher log-derivatives of the sites, which ensures that all sites are sufficiently 
regular so that we can use Taylor expansions and bound the remainder terms. Our assumption is simply that there exist 
bounds K 3 and A '4 which bound the third and fourth derivatives of all (f>i functions. For d £ {3,4}: 


Vi, x 




<K d 


1 Note that it corresponds to a slowed down version of the aEP update 


(13) 
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Both of those conditions are easy to check in practice. For example, for a Generalized Linear Model, we would simply 
need to check the derivatives of the link function and that the design matrix is bounded and of full column rank. The one 


important case for which we cannot apply our result concerns non-parametric models, and, more generally, cases in which 
p is not fixed but grows. This reflects a limitation of our proof, rather than one of EP, which works just fine in such cases 
(see Appendix for details). 

An important thing to note is that we chose those two assumptions because they give very simple expressions for the 
error of the asymptotic expression, but the limit behavior we present can still be reached even if they are broken. In the 
appendix, we show how weaker assumptions (bounded li ( x) and local smoothness of fa (x)) are sufficient to obtain our 
results on the limit behavior with similar asymptotic errors. 

2.2 Limit behavior of the EP update 

2.2.1 Limit behavior of the site update 

The only complicated step in EP and aEP (especially in practical implementation) is the site-approximation update during 
which we form the hybrid distribution, compute its moments and then subtract the contribution of the cavity to obtain 
the approximation of the site. We study here the limit behavior of the site-approximation as the cavity becomes more 
and more precise. The result we obtain is essential to the rest of this work, but not entirely intuitive, so our explanation 
will be progressive and careful. 

What we are interested in is the limit behavior of the site update, as the precision of the cavity becomes large. The reason 
we focus on the high-precision limit is that, when there are many sites (datapoints), each individual one makes a small 
contribution compared to the rest. The cavity represents the contribution of all the other sites, and generally speaking 
the more sites there are the lower the variance of the cavity (the higher the precision). In large-data settings, the cavity 
prior tends to dominate the site’s likelihood, meaning that at the level of individual sites, the “large data” limit becomes 
a “weak data” limit. 

To study that limit, our first object of interest is naturally the hybrid: 



where we have parametrized the cavity precision as /3, and the cavity mean stays constant (at po throughout) [^] As /3 
grows large, the cavity prior (the Gaussian part) outweighs the likelihood, and the hybrid starts to resemble a Gaussian 
centered at po with variance /3 . Indeed those are provably the limits of the mean and variance of hi when (3 —> oo. 


When (3 is large, the hybrid is almost the same as the cavity, and it is tempting to conclude that when (3 is large no 
update happens (the cavity prior outweighs the likelihood U, the site becomes negligible). That line of reasoning, although 
tempting, is misleading, as an examination of the case of a Gaussian site shows. Suppose 



then, regardless of how large (3 is, it is straightforward to show that the site’s natural parameters are always f3i = 7 and 
r, = a. In other words: even when the prior outweighs the likelihood, the site always increases the overall precision by an 
additive factor and contributes to the overall linear shift. 


In the non-Gaussian case the site’s natural parameters also have a non-trivial limit. The exact form of that limit turns 
out to be very interesting, as it shares a close relationship to Newton’s method. Specifically, we show that r, reflects the 
gradient of the log-likelihood at the cavity mean and (i. L the Hessian. In other words, the log of the site-approximation 
tends towards the Taylor expansion around po of the log-site <pi ( x ). Fig. [l] illustrates that behavior in a simple scenario, 
where: 



(14) 


which corresponds to a logit likelihood and a cavity prior centered at 0. Here 0* = log(l + e x ), ^(0) = — and 



l 

4 ' 


2 In the notation of the previous section, the natural parameters are A = [ (3 /3/lq ]*, the precision and linear shift. 
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Figure 1: Limit of site updates under increasing cavity precision. We use the example given by eq. (141, where the site is 
a logistic likelihood, the cavity prior has mean 0 and precision p. The quantities shown are: the mean and variance of hi 
(labeled nh and varh ), and the site parameters ry and Pi. As expected, the mean tends to 0, and the variance tends to 
P~ l . More surprisingly, the site parameters have non-trivial limits that can be computed exactly (see main text). 


We can now state our result formally. For simplicity, the case we have just discussed had increasing precision and a fixed 
mean. The following theorem is stated in a slightly more general case in which the cavity mean is slightly offset from n o, 
but tends to it in as p —> +oo. This more general case is important for the corollaries we derive from this theorem. 

Theorem 1. Limit behavior of hybrid distributions 

Consider the hybrid distribution: hi(x) = h (x ) exp ^ x 2 + (Pno — Sr) x'j . In the limit that p oo, the natural param¬ 

eters of its Gaussian approximation converge to: 

var hl E hi ~ var hlha - Sr + </>( (no) 

varfl k, P + (t>i (no) 


Thus, the natural parameters of the EP approximation (Vj = var h 1 Ef li (x) — (Pno ~ Sr) and pi = var h 1 — P) ofli converge: 


Ti 


Pi 


-<Pi(ho) + Piho + O (p *■+ Sr + ^i(n o) 
4>'i (ho) + o f/3 -1 + Sr + Pi(no) 


P 


-l 


Note the important role of the Sr term: it causes the cavity-mean to be slightly different from p 0 , but it can accelerate 
convergence when set precisely to Sr = —cf> i (p, 0 ) (see appendix). 


Proof. We only give a sketch of the proof here, because it is too long and involved. 

Our proof can be understood as simply computing the asymptotic behavior of Eh t (x) and var^ (x). The first order is 
easily found to be: Eh, (x) ~ no an d var^ (x) ~ /3 -1 . However, when we compute the new values for r,; and pi, the 
subtraction of the cavity parameters effectively cancels that first order term. In our proof, we thus go beyond the first 
order and compute the next order which gives us the claimed bound. 

In practice, we use two tricks that enable us to directly express pi and as expected values under the hybrid hi, which 
saves us from actually computing the limit behavior of the mean and variance. We then approximate these expected 
values using Taylor expansions and get the claimed result. See the appendix for details. □ 









2.2.2 Limit behavior of parallel-EP and aEP. 


Now that we have some handle on the behavior of site-updates, we can start to study the behavior of the algorithm as 
a whole. A full step of aEP or parallel EP is a combination of n site updates, and that is what we characterize next. 
We show that one step of parallel-EP or of one step of aEP both converge towards the result of one step of Newton’s 
algorithm. It is also possible to use Theorem [T] to describe the limit behavior of sequential-EP or of an EP variant which 
updates batches of sites sequentially, which tend to variants of Newton’^ We choose to focus on parallel-EP because its 
limiting behavior is classic Newton’s. 

Let’s first present the limit behavior of aEP, which is easier to visualize because it only has two parameters instead of 
2n-parameters like EP. One interesting feature of the limit behavior of the site-update is that the value of the cavity 
precision (3-i does not influence the limit behavior, which is only set by the cavity mean /r_j. In the aEP algorithm, the 
cavity mean is always equal to the current approximation mean. Thus, when we sum all ri and Pi approximations, we 
find that the limit behavior of the aEP update also has that feature: the approximation at the next step mostly depends 
on the current mean of the approximation, and corresponds to a Newton’s update. 

The limit behavior of EP is similar, but is a little more complicated to state. This is due to two additional complications. 
The first complication is that, in EP, each cavity mean /i_j is slightly different. This is where the Sr parameter from 
Theorem [l] comes into play: it enables us to see each cavity distribution instead as almost centered at the same mean but 
slightly offset in a specific direction. The second complication is that, whereas in aEP each cavity distribution has the 
same precision, once again each cavity distribution is different in EP. In the end, these complications hardly matter for 
the limit behavior, but they do make it slightly harder to understand how EP works. 

Theorem 2. Limit behavior of aEP and EP 

Consider a current EP approximation , and the corresponding aEP approximation (r = Y^ r i, P = ^2 Pi) whose 

current mean is po = . In the limit that all cavity-precisions P-i = Y^j^iPj tsnd to infinity (so that min(/3_i) is of 

same order as p), the limit behavior of one step of aEP and of one step of EP is identical to Newton’s algorithm. 

For aEP, the global parameters at the next step are: 


r aEP = -Ip'(po) + PaEP^O + O ^nP 1 +^|^(/U°) P ^ 
PaEP = ip"(Po) +0 i^nP -1 + E \p'i(Po) P _1 j 


For EP, the global parameters at the next step are: 


E 


r new 

1 i 


Er 

i 


-Ip'(po) + i^2,Pi ew j Po + O ^ nP 1 + E \ ri ~ fcvo + (p'i(vo) 
ip"(p 0 ) + O [nP^ 1 + E \ ri ~ & Vo + Piivo) 



Proof. This result is simply obtained by summing the approximations offered by Theorem [T] 

For aEP, this is simple enough: all the cavity distributions are Gaussians with precision f/? and with mean pq. 
Straightforward application of theorem [l] leads to the claimed result. 


For EP, this is more complicated since every cavity distribution is different. However, it is straightforward to check that 
the cavity densities are: 


g-i ( x) oc exp 


x 2 

{P ~Pi) y + {{P ~ Pi) Mo + PiP o - n) x 


We can then apply theorem [T] with cavity precision p — pi and offset Sr = ri — Pip, o, and recover the claimed result. □ 

3 For example, sequential EP would asymptote to a variant of sequential gradient descent with a Hessian correction. See[O pper| ( 1998] for a 
more extensive discussion of the link between sequential EP and sequential gradient descent. 
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2.3 Where to find EP’s fixed points 


In this section, we use the results above to find out more about the location of fixed points of EP and aEP. We show that 
wherever the posterior distribution has a strongly peaked mode, a fixed point of EP or aEP lies in the vicinity. Our proof 
relies on an application of Brouwer’s fixed point theorem, and relies on finding stable regions of the parameter space, in a 
sense we need to make precise. 

Since Newton’s iterations are strongly contractive towards posterior modes, and since our results tell us that the iterations 
of EP and aEP are not far from those of Newton’s, there is a good chance EP and aEP do not stray too far from posterior 
modes either. Indeed, we prove that there exist compact regions of the parameter space near the CGA which are stable 
under the aEP or the EP updates: i.e., if we start from inside of them, we stay inside. We can picture these stable regions 
as boxes in parameter spaces inside of which aEP and EP get stuck. 

Unfortunately, our bounds are too weak to guarantee that the iterations of aEP and EP converge in such regions. We know 
that they cannot exit the box, but we cannot prove that they do not wander around forever inside of it. However, there 
is a much more interesting consequence of the existence of such stable regions: from the Brouwer fixed-point theorem, we 
know that any compact stable region must contain at least one fixed-point of the corresponding iteration, and so we have 
boxes in parameter spaces that contain both a fixed point of Newton’s and a fixed point of aEP/EP. In order to apply 
this insight, we would then want to find stable regions that are as small as possible in order to give the tightest bounds 
on the position of that fixed-point. 

In this section, we focus on identifying stable regions that are a close neighborhood of the CGA at the mode of the target 
distribution, and we compute the correct asymptotic scaling of the size of the stable region. These results are sufficient 
to prove that aEP and EP are both exact in the large-data limit. However, it would be an interesting extension of the 
present work to also find maximal stable regions, and to find “unstable” regions: regions of the parameter space that the 
EP iteration is guaranteed to leave and which therefore cannot hold a fixed-point. 

We find that all modes of p (x) have the potential to have an associated stable region, and that the size of that stable 
region depends on log-curvature at the mode: more peaked modes have an associated region that is smaller than flatter 
modes. We use this result in the next section to prove that aEP and EP fixed points converge to the CGA at the mode 
in the large-data limit. 

We use aEP to outline this result. Let’s assume that the starting global approximation is in close proximity to the CGA 
at a mode x * of p ( x ): 


VaEP — PaEPX*\ < nA r 


PaEP - Ip (£*) 


< nAf 


By applying Theorem [T] to a Gaussian approximation centered at x*, we find that the new value of the aEP parameters 
are such that t^ep - PaEP x * is small: 


r n a e E W P - KepX* = O In $ (:r*) - A 0 


-l 


nA r 




\<Pi( x *) 


ip (x*) - A 0 


“I -1 ' 


(15) 


If the error is smaller than nA r , the initial region would be stable in r. In order to find the limits of the stable 
simply need to find values and A r for which we can guarantee that the error is strictly smaller. Inspection 
suggests immediately that the curvature at the mode (represented by %p (x*)) plays a key role: the larger the 
the tighter the bound. 


region, we 
of eq. (151 
curvature, 


For EP, the stable regions take the form: 


n + <pi (x*) - ftx* 

A ^ <Pi i x *) 


< A r 

< A 0 


which ensures the global approximation is inside stable regions with the same form as those for aEP. 

A. r and A^ are small if the log-curvature at the mode is sufficiently high, as summarized by the following theorem: 

Theorem 3. Convergence of fixed points of EP and aEP 
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There exists an EP and an aEP fixed-point close to the CGA ofp(x) at x* if ^ (x*) is sufficiently large. More precisely, 

if: 


SaEP = max (I< 3 , K 4 ) ^ U z (x*) if) (x*) 


6 = nmax (Ii 3 , K 4 ) \ip (x*) 

are 0(1) quantities and if) (x*) is large, then the limit of the stable regions on the global approximation, nA r and nAp, 
scale as O ( S a EP + S) for aEP and as O (6) for EP. 


Proof. We only sketch the proof of this theorem which we detail in the appendix. We focus on the simpler aEP case but 
the reasoning is identical for EP. 

The key idea is the following: if we perform a first order perturbation in r a EP or in p) a EP while ff a EP is large, then this 
perturbation has a negligible effect on the limit behavior. Thus, the error still scales (almost) as if we were starting from 
the CGA at x *: /3 aE p = ip" (x*) and r aE p = PaEpX*: 


r aEP = P2 ep x * + ° [n max (K 3 ,K a ) ip (x*) 


PTep = (x*) + 0[max(K 3 ,K A ) 


ki ( x *) 


i=1 


Ip (X*) 


n -1 


from which we get the claimed limit behavior. 


□ 


Remark. It might seem strange to refer to S and S a EP as “order 1” quantities. This holds in the large-data regime as 
we show in the next section, but an easier example to visualize is the following. Consider the EP approximation of a 
fixed-probability distribution raised to power A: [p(:c)] A = H; [h ( :C )] A - For that example, as A —>• 00 , A ip (x*) —>• 00 : 
the log-derivative at the mode grows linearly. However, the third and fourth log-derivatives also grow linearly so that 

r „ 1 —1 


S = n max (A' 3 , Kf) 


if (x*) 


does not depend on A: it is indeed of order 1 . S aE p is similarly found to be of order 1 . 


This theorem shows two interesting features of the behavior of EP. First of all, we see that if p (x ) is a multimodal 
distribution where multiple modes are sufficiently peaked and they are sufficiently separated, then EP and aEP both have 
multiple fixed points, and those fixed points do not give a global account of p (x) but only fit the local shape of p (x) 
around “their” mode. This is quite contrary to the common view on EP which holds that since EP’s stated target is to 
find an approximation of the minimizer of KL ( p,q ), it gives global approximations of the target distribution. We discuss 
this point further in section [372} 


2.4 Large-data limit behavior 


So far, all of our results have been deterministic: assuming some fixed target distribution p(x), we have bounded the 
distance between the result of the aEP and EP updates and the result of the NT updates. We then used those results 
to derive a deterministic result on the possible positions of the aEP and EP fixed points. We have discussed asymptotic 
results in those sections in terms of either asymptotes of the parameter space (large precision j3) or of properties of the 
target distribution (large log-curvature at a mode ip (x*)). 


In this section, we adopt a different point of view: we seek a large-data limit result. In other words, we assume that some 
random process is generating the sites /, (x) and we consider what happens as more and more sites are generated. In real 
applications, this would correspond to accumulating data of some kind and computing the posterior of the unknown x 
under some (most likely miss-specified) generative model. We abstract all those complications away and simply treat the 
li functions (or, equivalently, the <pi functions) as function-valued random variables. 

Throughout this section, the number of sites, n, is variable. We note p n (x) the random variable of the posterior distribution 
constructed from the n first sites f (x). We note Xg the minimum of x —>■ E ( (pi (x )): xg can be thought of as the “true” 
value that we seek to recover. We note cc* the mode of p n (x) closest to Xg and q n (x) the CGA of p n (x) at x*. 


In order for EP to have good behavior, we require the process generating the f to obey our assumptions on the li (eqs. 
@ and @) and to obey two additional conditions. The first condition is that the distribution of the log-sites (pi (x) 
is non-degenerate so that a number of variances are finite and we can apply the law of large numbers (see appendix 
for details). This is a mild condition and, in the rare occasion where it does not apply, we could even weaken it. This 
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condition ensures that the log-posterior ^ fa (x) is Locally Asymptotically Normal (LAN, Kleijn et al. ( 2012| )). Under 
this LAN behavior, we prove that every global approximation in the aEP and EP stable regions around x* converge in 
KL divergence and in total-variation towards the CGA at x*: q n (x). 


Furthermore, if the process generating the l, produces a concentration of the mass of the posterior around Xo, then, 
combined with the LAN behavior of the posterior, we have enough to guarantee that the posterior p n ( x ) converges 
towards its CGA q n (x) in total-variation. This result is the last we need to prove that, in the large-data limit, aEP and 
EP are exact in the following sense: there is a large neighborhood of aEP and EP approximations that surrounds at least 
one fixed-point and where the aEP and EP iterations are “stuck”, such that all approximations in the neighborhood are 
asymptotically exact. 

The technical condition we require for concentration of mass is the following: for all e > 0, the integrals f p n ( x ) 1 (|x — Xo| < e) dx, 
which are random variables whose distribution is dictated by the distribution of the U (x), need to converge in probability 
to 1. In other words, for every e > 0, the posterior is guaranteed to concentrate inside the e-ball centered around xq■ This 
should be thought of as an identifiability condition: it requires the posterior to concentrate around the “true” parameter 
value Xq. 


Theorem 4. aEP and EP are exact in the large-data limit. 

Under our assumptions on the sites and the site-generating process, all Gaussian distributions in the stable region of th. 
[^| converge in total-variation to the CGA q n with probability 1. The convergence rate is O (n -1 / 2 ). 

Under a further identifiability assumption, all Gaussian distributions in the stable region converge in total-variation to p n 
with probability 1. The convergence rate is O (n -1,/2 ). 


Proof. Here is a sketch of the proof. First, define Iq = E (d> i (xo)j the Fisher information of our likelihood-generating 
process. 

By a law of large numbers argument, ]C " =1 (^o) ~ nlo- this quantity grows linearly. In the meantime, the gradient 

at Xq is small: ^ ) "_ 1 (p i (xo) « ar (ffl (xo)). Combining this with the third derivative bound and a simple Taylor 

expansion of (U’A-, <j) i (xo) proves that there must be a mode of p n : x*, in close proximity to Xq. The distance between 
the two scales as: x* — Xq = O (1 /y/n). 

Since the log-curvature at Xq grows linearly and Xq — x* = O (1 /\fn), the log-curvature at x* also grows linearly: 

n 

( x n ) ~ nlo 

i =1 


Similarly, y~)(' , (x*) grows linearly and n max (K 3 , K 4 ) trivially grows linearly with n. Thus, the conditions of th. J^j 

are checked: there exists a stable region near the CGA at x* which holds at least one fixed-point for aEP and EP. 


To prove the total-variation convergence, we actually prove a KL-divergence convergence. The bounds on A r and 
translate into a O (n _1 ) KL-divergence bound. From Pinsker’s inequality, this translates into a rU 1 / 2 total-variation 
bound. 


The proof then concludes by proving a n 1 / 2 convergence of the CGA towards p n which is a simple application of a 
Bernstein-von Mises theorem for miss-specified models by Kleijn et al. (2012). □ 


As a corollary from this theorem, it follows that EP is asymptotically exact in all models which respect our hypothesis on 
the li and the ^-generating process and which are identifiable. This includes an extremely large class of models since our 
conditions are fairly mild and since identifiability is a key requirement for the Bayesian method to be useful. For example, 
our result can be applied to both probit and logistic regression in finite-dimensions, as long as the feature vectors are 
bounded (in order for our hypotheses on the li to be verified), and spread uniformly enough for the Fisher information 
matrix to be strictly positive (see Appendix). Three notes must be made on that theorem. 

First of all, it shows that EP and aEP are exact in that there exists a fixed-point which converges in total-variation to 
the true posterior. It does not guarantee that asymptotically all fixed points converge. In particular, should the expected 
value of (fi (x) have several modes, we can guarantee that there also exists aEP and EP fixed points which are terrible 
asymptotic approximations of p n (x): the stable regions associated with the local minima converge to the CGA at a mode 
with negligible asymptotic contribution to the mass of p n . 

Second, it could seem from this result that EP and aEP approximations are asymptotically worse than the CGA, because 
the total-variation distance between the EP fixed-point and the target decreases slower than for the CGA. This does not 
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reflect a limitation of EP but it is a feature of our proof: we have proved that EP is good because it converges to the CGA, 
and only a direct proof would be able to prove the superiority of EP. We expect that EP should give better asymptotic 
approximations than the CGA from empirical tests of both methods, but the result we present here is too weak to prove 
this conjecture. We have recently made some progress on such a direct proof, but under more restrictive assumptions than 
the ones presented here (Dehaene and Barthelme 20151. 


Finally, it is interesting to come back to th. [2] which shows that, in the large cavity precision limit, aEP and parallel EP 
converge to NT. It is also possible to qualitatively discuss th. [2] in terms of a large-data limit instead. In order to do so, 
we assume that, as the number of data-points grows, the typical value for the cavity precisions /3_j grows linearly with n. 
This is certainly true in the stable region around Xq as we have just shown in th. [4] If we have that min (/3_i) oc n, then 
the errors in th. [2] are of order 1. These order 1 errors are negligible in the large-data limit, in exactly the same way as 
for th. Thus our result that aEP and parallel EP are almost Newton in the high precision limit qualitatively apply in 
the large-data limit as long as the “typical” cavity precisions grow linearly with n. 


3 Consequences of the quasi-Newton behavior of EP 

In the previous section, we gave a proof that, in the limit of large-data, EP behaves like a Newton search for the mode of 
the target distribution. In this section, we highlight how this result can inform our intuition about how EP behaves, and 
some potentially interesting avenues of research it opens. 


3.1 Instability of the EP iteration 


Newton’s algorithm (NT) is a good tool in finding a mode of a target distribution as it has fast convergence if it is 
initialized properly (i.e.: close enough to the mode), but it can often fail to converge globally. For example, applying NT 
to f(x) = exp(— |a;| 4 / 3 ) always results in a divergent sequence that oscillates wildly around the fixed point at x = 0. More 


generally, Newton is unstable when the log-curvature ip is small because that makes the Newton step 
In order to fix this problem, it is necessary to introduce a “slowed-down” version of the iteration: 




n -l 


ip too big. 


Ht+i = Ht - 7t 




where 0 < 7 * < 1 is chosen carefully to ensure convergence. As the NT algorithm is part of the class of Generalized Gradient 


Descent algorithms, one solution is to choose values that respect the Wolfe conditions (see Boyd and Vandenberghe 


2004 for a convergence analysis of Newton’s method). 


Since EP behaves like NT in the large-data limit, we can intuit that even for small n, EP might have a qualitatively 
similar behavior. In particular, EP iterations might oscillate around their fixed-point just like NT does. We give here a 
simple example of this behavior with sites that are extremely regular and which seem harmless at a glance. 


In our example, we applied a parallel version of the EP algorithm to the following situation: 


• five “double-logistic” sites: \/i £ {1,..., 5}, li(x) = (1 + exp(5cc )) _1 (1 + exp(—5x)) _1 , so called because they are 
the product of two logistic functions. If plotted, these appear to be Gaussian at a glance, but with the important 
difference that they only have exponential decay in their tails. In these exponential tails, the log-curvature (p i (x) is 
very small. 

• one Gaussian site representing a prior: Z 0 (x) = exp(— x 2 /2) 


In this example, there is an EP fixed-point that provides a good approximation of the target distribution. However, we 
also found that the EP iteration is unstable if it is initialized too far away from the fixed-point distribution. EP iterations 
initialized too far away converge to a limit cycle oscillating between two approximations that are completely wrong. Figure 
[ 2 ] shows the basins of attraction of the stable equilibrium and the limit cycle, and one example trajectory for each. 

Our results on the limit behavior of the EP iteration can thus inform our understanding of why the EP iteration sometimes 
has problems with convergence: it could be that the EP iteration overshoots when it is operating in a zone where most 
t p i (x) are small while most <p i (x ) are big, exactly like NT would. A possible solution to this could be to complement EP 
with an adaptive step algorithm. This algorithm would need to detect overshoots or potential overshoots, and prevent 
them. Finding such an adaptive step algorithm would represent major progress in EP methods. 
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Figure 2: Convergence of aEP on double-logistic sites. We evaluated the stability of the averaged-EP algorithm on five 
double-logistic sites and one Gaussian prior site. The aEP iteration either converged to a fixed point (red square) or to 
an oscillation (between the two red diamonds). The yellow background corresponds to the basin of attraction of the fixed 
point and the green background to that of the limit cycle. The first four steps of the aEP iteration are presented for two 
initial points, one in each basin. Note how fast the convergence to both attractors is: in four steps, the iteration reaches 
a very close neighborhood of the attractors. 

3.2 Behavior of EP on multimodal distributions 

Let’s now investigate how our results shed new light on the behavior of EP on a multimodal target distribution p ( x ). 

EP has been presented from the start as a rough approximation to the minimizer of the “forward” KL divergence: KL (p | |q) 
that uses local (i.e., site-specific) approximations of the KL divergence. This leads to the intuition that, when applied on 
a multimodal target, the EP approximation would fit all modes, or maybe most modes, since this is the behavior of the 
KL approximation. 

With our method of bounding fixed points in neighborhoods of the CGA at the various modes of p{x), we can now see 
that this intuition is flawed. Indeed, our result shows that all modes that are: 

• sufficiently peaked, so that the stable region is small 

• sufficiently isolated, so that their stable region does not overlap with that of the other modes 

have at least one associated fixed-point. This fixed-point corresponds to the EP approximation fitting only this mode and 
not the rest of the probability distribution. Thus, it can happen that EP approximations give only a partial account of 
the target distribution. 

However, it’s also false to believe that EP always gets captured and never provides a global approximation of a multimodal 
target. Indeed, when there is only one site (or more generally, when there is only a few sites), EP does give a global account 
of the distribution since EP with one site exactly recovers the KL approximation of p (a;). 

Surprisingly, both types of fixed points can co-exist. Figure [3] shows an example involving Gaussian mixtures, the proto¬ 
typical example of multimodal problems. Here the data y-\ ... y n are supposed IID, with 

p(y*l x ) = \ (N{yi',xi,l) +Af 

The parameters correspond to component means in the Gaussian mixture and are evidently interchangeable, so that the 
likelihood surface is in general bimodal. We ran EP in this example with n = 20, x = (0, —2.5) and a unit Gaussian prior 
on x. Different initializations lead to different fixed points: we found three, two corresponding to local approximations (as 
predicted by theory) and a global one, the latter far from any mode. Interestingly, the local approximations are locally 
“exact”, meaning that under the identifiability constraints X 2 > x± , the moments of the corresponding EP approximation 
are exact. The mean of the global approximation matches the exact global mean, although the covariance is a bit under¬ 
estimated. 

A simple take-home message from our work should thus be this one: do not expect EP to fit all modes of a target 
distribution, but do not automatically assume that it will fit a single mode either. 
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Figure 3: Behavior of EP on multimodal target distributions. The grey density represents the target, and the ellipses 
different EP approximations (summarised as 95% confidence regions). In this example EP has three possible fixed points, 
one corresponding to a global approximation of the target and the other two to local approximations. Which fixed point 
is reached depends on the initialisation. See text for details. 


4 Conclusion 


EP is an algorithm whose theoretical analysis lags far behind its empirical success. We describe in this manuscript a 
number of results that narrow the gap between theory and empirics, and we hope that they will provide a useful basis for 
future work. 


In this article, we propose a simpler version of EP which we call averaged-EP or aEP. aEP could be interesting as an 
empirical algorithm (see Appendix and Li et al. (2015) who introduce a close variant of aEP called stochastic EP). However, 
our main focus is on using it as a theoretical tool for studying the asymptotics of EP, since its reduced parameter space 
makes the results simpler to understand. We derive analytical results on aEP and EP in several limits: in the limit of 
large cavity precisions, and in the classical large-data limit. We prov that both methods converge to a Newton’s search 
for a mode of the target distribution. We then show that both are asymptotically exact in that there exists a fixed-point 
which converges towards the target. 

Our theoretical results open several avenues of research into gaining a better understanding of EP. First of all, while we 
shed some light into the behavior of the EP iterations by providing a qualitative link to Newton’s method, we still do 
not know how to build a variant of EP which is guaranteed to converge. This is a key avenue of research since the only 


way we know to guarantee convergence to an EP fixed-point, the Expectation-Consistent algorithm (Opper and Winther 


2005), converges much more slowly. An algorithm that always converges while staying as fast as EP would represent a 


major step forward. The parallel with NT opens the interesting idea of designing a line-search extension of EP. 


Another limit of our result is the coarseness of our bounds: while we show that EP is asymptotically exact, we do not show 
that it improves on the Canonical Gaussian Approximation when there is ample empirical evidence that it does. Future 


theoretical work on EP should aim at showing how and when EP does dominate the CGA (see Dehaene and Barthelme 
(2015) for one such investigation, though crippled by unrealistic assumptions on the model). 


A final interesting extension of this work concerns the non-parametric case, and, more generally, EP approximations of 
high-dimensional posteriors. Indeed, we believe that our results are sub-optimal in bounding how the error scales in 
high-dimensional cases, which is why we cannot apply our results to the non-parametric case for which p = n. A careful 
extension to show that EP behaves correctly in those cases would prove another huge step forward in providing a good 
theoretical basis for EP. 
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Supplementary information 

The following two sections hold all the supplementary information of this article. 

5 Proofs 

In this section, we will give detailed proofs of all the results we have presented in the main text. 
We will prove, in order: 

1. the limit behavior of the EP update in one-dimension 

2. the limit behavior of the EP update in high-dimensions 

3. the limit behavior under weaker assumptions 

4. the exactness of aEP and EP in the large-data limit 
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5.1 Assumptions 


We will prove our results in the one-dimensional case and in the n-dimensional case. Let’s first recall our assumptions on 
the likelihoods in the one-dimensional case. We will explain in section [5. 3 1 how to modify these assumptions in the high¬ 
dimensional case. In section |5.4| we show that these assumptions can be weakened considerably, though the expression 
for the errors is much harder to state. 

Let li {x) = exp (—</>* (a;)) be the sites, and (fi (x) be the negative log of each site. 

Our first assumption will be that the second log-derivatives of the sites span a finite range: 


max | 0 ,- (x)J — min \ <p i (x)J < B (16) 

and second, that some of the higher log-derivatives are bounded. There exists constants K d for d £ [3,4] such that: 

<t>i d) (x) <K d (17) 


5.2 Limit behavior of the EP update 


In this section, we will prove the following theorem. 


Theorem 5. Limit behavior of the site-approximation 

Consider the hybrid distribution: hi{x ) = h (x) exp jx 2 + (/3/r 0 — Srfx'j. In the limit that /3 — » oo, the hybrid mean 
Ehi {x) and the natural parameters of the EP approximation (ri = varf 1 Eh t (x) — (Pho — dr,;) and Pi = varf 1 — P) of U 
converge. Defining Km = max (Ks t Ki) and A r = ( (j> i (/ xq ) + Sr J, the limits are: 


E hi (x) = p hi 
n 


Pi 


ho +0(A r f3 1 + KmP 2 ) 

PihK - E hi (x)j 

~4>i{ho) + Pih o + O ( K M ft 1 + EmP 1 [hhi — ho] + Km [hhi — ho] ^ 
^fiiiho) + Piho + o {KmP 1 + P 2 ) 

Ehi (&'i (z)) 

4>i {ho) + C (KmP 1 + Km [hhi — Mo]) 
tfii {ho) + C (KmP 1 + KmAj-P 1 ) 


Remark. Note the key role of the Sr parameter. It causes the mean of the cavity distribution to be offset from /To, but it 
can make the mean of the hybrid, hhii closer to ho- Indeed, if Sr is such that A r = 0, then we gain an order of magnitude 
in the limit behavior of hhi and the errors in the limit behavior of both and pi are smaller. 

Proof. Let’s first sketch a global overview of how the proof works. Intuitively, what is going on is that we are going beyond 
the first order approximations of the mean and variance of hp. 

E h . (x) « ho 

var hi (x) « ft - 1 


and computing the next order of their limit behavior. If we try to follow that path directly, however, we obtain bounds 
that are a bit ugly and not very tight. A better proof path is slightly clever and sophisticated and consists in finding 
“tricks” ways toof bounding Pi and ri directly. This is where the Brascamp-Lieb inequality comes in play: it provides 
one-half of the pi bound. 

In practice, our proof can be decomposed into five steps: 

• Upper-bound var/,. {x) in a coarse way 

• Upper-bound and lower-bound var^. (a;) in a fine way 
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• Prove a coarse bound on the Ef lt (x) — no = nh t — no from the coarse bound on the variance 

• Use the bound on nK = Eh { (x) to improve the bound on var^. (x) to its final state which provides us with the 
bound on fa 

• Compute the limit behavior of r 2 from the coarse limit behavior of var? li (x) and jih, 


Limit behavior of varq M (x) and of fa = var/^ (x) _ — /3 First, we will deal with the variance of the hybrid. We will 
use the Brascamp-Lieb result and a Cramer-Rao like bound to derive the final bounds on fa. The Brascamp-Lieb result 
will also give a coarse bound on var^. which we will use in the other sections. 

Let’s start by upper-bounding the variance with the Brascamp-Lieb bound. This will also give us the coarse bound we 
need on var/,, (x). 


Consider the value of fa at /io- It gives us a universal lower-bound on fa (x) (from assumption 16): 

fa (x) > fa {no) - B 

Thus, when the cavity-precision j3 is sufficiently large, the hybrid distribution h t is strongly log-concave: its second 
log-derivative is lower-bounded everywhere by a strictly positive quantity: 

q2 n n 

-g ^2 log ( h * (x)) = fa 0) + /3>fa (Mo) +13- B 

For log-concave distributions, the variance of any statistic is upper-bounded by the Brascamp-Lieb inequality. In particular, 
the variance is upper-bounded by: 


var hi (x) < E hi ( (j) i (x) + (3 


I “I 


(18) 


From this and the curvature lower-bound, we get a coarse upper-bound on the variance: 

-l 


Yw hi (x) < fa (po) + /3 - B 
< O (/3 -1 ) 


(19) 


This coarse upper-bound is the first step of our proof. 

Now that this coarse bound is established, we continue working on var/j. (x) and backtrack to eq. 18 which we will simplify. 

M~ l for 


In order to simplify it, we will use several properties of the inverse function: 


M 


M G 


<t>i (Mo) + (3 -B,fa {no) +13 + B 


First, we can perform a simple Taylor expansion of the inverse function around E^ {^t> i (a;)^ + /3 in order to simplify 
r „ n-1 


</>■' (x) + p 


fa (x) +13 


1 -1 


e K [fa (x 

E hi (& {x 
Ehi (<!>i i x 


1 “I 


1 -2 


■P 


I -3 


•i {x) - E hi [fa (x) 
•i (x) - E hi (fa (x) 


( 20 ) 


This expression is a rough approximation with which it is hard to do rigorous mathematics. We would like to replace it 
with an upper-bound. Thankfully, the inverse function M —> M~ l is convex and its curvature is bounded over the interval 


of interest: M G 


fa (Mo) + P — B, fa {no) + f3 + B 


. We can then upper-bound it by a second degree Taylor expansion 


in which we replace 


E h 


— 3 r „ -1 -3 

(x)) + f3 by an upper-bound fa {no) + (3 — B . This yields: 


fa {x) +13 


< 


E hi ($i (x))+/3 
- E hi (fit (x)) + (3 (</>■' (x) - E h . (&'( (x))) 

+ \fa (+o) +/3 -b] (fa (x) - E hi (fa (x 


( 21 ) 
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We can now get an upper-bound for the variance by combining this upper-bound (21 ) and the Brascamp-Lieb inequality 
eq. (fl8|): 


var h . (x) < E h . ( 0 (x) + (3 


< 

< 

< 


E h , ( 

fa 0 )^ 

)+/?; 

E h , ( 

fa (x) / 

)+? 

Eh, ( 

fa {x)^ 

)+?. 


fa (Mo) + P- b 


i -3 


varft. 


{x 


i -3 


fa (/•*o) +P-B J Kpax h . (x) 

1 -4 


Ki 


fa (Mo) + P - B 


( 22 ) 


where we have bounded the variance of fa using a Taylor expansion, and re-used our coarse bound on the variance 


We will now invert this expression (22l.We use the convexity of the inverse function: (a + b) 1 > a 1 — a 2 b (with 
a = E h . (fa- (x)^ + 0) to find: 


var , 1 (a;) > 


i 0*0) + P 


> 


E h 


Eh, (fa (x)) + 13 


I<1 


+ Ki 

E h , 


fa (Mo) +13- B 

n 2 r 

-(3 


-4 


fa (w) + 13- B 


-4 


> 13 +E h , 


fa (a*o) + (3 + B 


(x )) - I <1 + - 

[fai (Vo) +P~B] 


-4 


> P + E h , [fa (x))-0{Kl(3- 2 ) 


(23) 


Note that in that last expression, the terms are ordered according to their asymptotic behavior: linear-term, order 1 term 
(which also contains a /3 _1 term) and a (3~ 2 remainder. 

We finally expand fa (a;) around /ift, = Ef t , (x) in order to simplify the expected value Eh, (^j> i ( x) 

E hi (fa (a;)) > fa {nh ,) - ^va x h . (x) 

> fa (/bj ~ ~y fa (Mo) +/3 - B 

Which gives us a lower-bound for 0 = var h, ( x ) — p: 

$i i^o) + (3 + B 


0 > fa (Hhi) - K4 [fa (mo) +13- B 

> & (fi hi ) - O (Kip- 1 + Ki/3- 2 ) 


- K. 


3 [fal (Mo) +P~B] 4 


(24) 


This is not our final lower-bound on 0 because it refers to the hybrid-mean /i/,. and not to the cavity-mean /i 0 . As 
explained in the proof outline, we will further along the way bound /i h, — /io- We will then be able to backtrack to our 
bounds on 0 which refer to /ift, and, with yet another Taylor expansion, make them use /io instead. 

However, before we work on /ift,, let’s conclude our work on 0. We will now upper-bound 0 which requires lower-bounding 
var ft,,. 
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We will make use of a Cramer-Rao inequality It reads: 

var- 1 (x) < E h . (j>- (x) + p^j 


< 


P + 4[ (mC + K 4 E h . ((x - hh J 2 /2j 


K a 


< P + <t>i (mC + -^-var/,. (x) 

< P + 4>i (phi) + ~ 2 ~ (Mo) + p — B 

where we have used a Taylor expansion and the coarse-bound on var^ (x) (eq. ©)■ 
This gives us an upper-bound on Pp. 

Pi < 4'i (fj'hi) + 4i (mo) + p-b 

< 4i^h % ) + 0(K 4 p~ l ) 


( 25 ) 


(26) 


Combining the upper and lower-bound (eqs. 24 and 261 , we find that we can express the limit behavior of Pi as a function 
of fJ-h, ■ 

Pi = 4i (mC + O {iup- 1 + Kip- 2 ) (27) 

This last result concludes our first section on var^ (x) and /?y 

Limit of [Ahi The next step in our proof is to work on the mean of the hybrid. We will now show that ss /Lto- This 
will give us the final expression: Pi ss (/Lto) and will also be important in deriving the limit behavior of ?y. 

We start from a “Stein relationship”: with integration by parts, we find that it would be true that, for any probability 
distribution, the expected value of the log-gradient of the distribution is always equal to 0: 


E p [4 (x)) = 0 


(29) 


where we have used p (x) <x exp (—ip (x)) as an example. 

We then apply that relationship to hi whose log-gradient has contributions from the site and from the cavity distribution: 


Ehi (4i ( x ) + X P ~ (Ph o - Sr )) = 0 

P [ntH - Mo] = Sr - E hi 

We now perform a Taylor expansion of E\ li (x)): 

Ehi { 4 i ( x )) ~ 4 (Mo) + 4 (Mo) [m/u ~ Mo] 
The error in that expression can be upper-bounded: 

Ka 


(30) 

(31) 


(32) 


error < -^-var^ (x) 


< 

< 


Ka 


2 

O ( K 3 p 


<Pi (Mo) + P — b 


-1 


(33) 


4 Consider estimating x from the observation x = x + rj where the noise has distribution hi (77 — Hhi) : it has mean 0 and variance var^. 
x is an unbiased estimator with variance var^. The Fisher information in x about x is: E^. ycf) i ( x ) + pj . This gives the claimed Cramer-Rao 
inequality. 

See example 10.22 from Saumard et al (2014). 

5 Note also the slight variant: 

pi = E h . (0" (x)) + O (K%I3- 2 ) (28) 

This variant is not used again in the rest of the work we present here but is the next order of the expansion of ft which could be of interest 
in expansions. 
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( 34 ) 


We combine that Taylor expansion (321 and that bound (33) with the Stein relationship pll) : 


P + fa(H o) [m/m - Mo] = Sr - 4>i (yu 0 ) + O (K 3 p *) 

From which we can deduce a coarse bound on Hhi — go'- 


Hhi - Ho = 


P + fa (Ho) Sr - fa (go) + O (K 3 p 2 ) 


= O 


( Sr + fa (g 0 ) 


/ 3 _1 + K 3 p 


-2 


(35) 


This coarse bound will now be used to slightly update eq. (24) in order to remove the dependency on g^. and obtain the 


final equation on the limit behavior of fa. We will also use it in the final section to compute the limit behavior of i y. 


Returning to fa We combine the coarse bound on g^ with our expression for the limit behavior of fa. We find: 

Pi ~ fa (Ho) = fa ( Hhi) - fa (Ho) + O (Krf- 1 + K 3 p~ 2 ) 

= O (K 3 \g hi - ho] + K 4/3- 1 + K 2 r 2 ) 


= O K 3 


(k 3 Sr + fa (ho) 


13 


-1 


KsP 


2 o—2 


Kifa 


-l 


Kir 2 


(36) 


Which completely concludes the proof for the expression of the limit behavior of fa. 


Limit of n = var ;i 1 (x) Hhi — (Pho — Sr) We now turn to the task of approximating r^. Our very first step here is to 
use the Stein relationship we already used above (eq. ©). This gives a simple expression for r, without going through 
the limit behavior of the hybrid mean and the hybrid variance. 


We start from the Stein relationship we had before (eq. (301): 


Eta (fa (x) + x/3 - (/3 ho - Sr)) = 0 
E hi (fa (x)) + PHhi - Pho + Sr = 0 


(37) 


Let g (x) be the Gaussian with same mean and variance as the hybrid. The natural parameters of g are the sum of the 
natural parameters of the cavity distribution and of the approximation of hi. In other words, its natural parameters are 
P + Pi and PiHo ~ S r + rn 

g = V (fa) oc exp [p + Pi] y + [ Pho - Sr + n] x'j 
If we now apply the Stein relationship to g, we find: 

E g (\P + Pi] x - (PHo ~ Sr + r,)) = 0 

(PiHhi ~ n) + (PHhi ~ PHo + Sr) = 0 (38) 


Combining these two equations (371, (381, we find: 


Ti = PiHhi - E hi 



(39) 


which is a nice simple expression for tv We have thus found a way to work directly on r 2 ; while avoiding direct work on 
the mean and variance^ 

We will now simplify eq. ( [39] ) . We start from a slight rewriting: 

Ti = PiHo + Pi (H^ - Ho) - E h . (fa (x)) (40) 

6 Note that, in the exact same way as for eq. ( |28| ), this equation is interesting in its own right for expansions of our result as it is the next 
order of the expression for r*. 
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Let’s expand the E^. ^A (a;)) term. We just perform a Taylor expansion (around /i 0 ): 

Ehi f A (A) = A (mo) + A (mo) (m/w - no)+ 0 (k 3 (vaihi + (M/u - Mo) 2 )) 
= A (mo) + 4>'i (mo) {Hhi - Mo) + O (k 3 P~ 1 + K 3 [hk ~ Mo) 2 ) 
This gives us the following expression for rf. 

n ~ Am o - A (mo) + (A - A (mo)) (Mh< - Mo) 


( 41 ) 


(42) 


We now just need to bound A ~~ A (mo) to finalize our proof, which we have already done in eq. (36). We then obtain 
the final expression for the limit behavior of rf. 


ri - A Mo - A (Mo) = C> ((A - A (Mo)) (m/u - Mo) + K 3 p 1 + Ji 3 (m^ - Mo) 2 ) 


= O 


(k 3 P 1 + K 4 /3 1 (Hhi - Mo) + K 3 (phi - Mo) 2 ) 


(43) 

□ 


5.3 Limit behavior in high-dimensions 

Our result generalizes to the p-dimensional case: when the probability distribution we are trying to approximate concerns 
a p-dimensional random variable. 

The main difficulty in stating and in understanding that case comes from the tensor notation that we have to work with. 
Indeed, all of the moments and the derivatives we have to work with shift from being scalars to being p-dimensional 
tensors of various orders. 

5.3.1 Tensor notation and rephrasing the assumptions 

Let’s first recall what tensors are. A tensor of order k is simply a multilinear mapping of (R p ) fc to R: it takes k vectors 
and returns a scalar. We will note this T [vi,V 2 ■ ■ ■ Vk\. This is a simple extension of vectors (order 1 tensors) and matrices 
(order 2 tensors). In our examples, we will always deal with symmetric tensors for which the order of the arguments does 
not influence the outputted value. Finally, an order k tensor can be also be used as a multilinear mapping with fewer 
than l < k entries: it then returns an order k — l tensor. This simply corresponds to specifying some of the inputs to the 
original tensor and leaving the assignation of the other inputs for latter. We will note by leaving to be specified inputs 
with a minus sign: eg T [v± ... Vk- 2 , —, —] for l = 2. For example, if we were to perform a Taylor expansion of V0 (x) , 
this Taylor expansion would need to return a vector (ie: an order 1 tensor). The first term of the expansion would be 

/q\ 

Hpi (x 0 ) [(x — x 0 ) , —], the second term <f>) [(x — x 0 ), (x — x 0 ) , —], etc. 

In order to state our result in p-dimensions, we will first need to extend our assumptions to the high-dimensional case. 
This is relatively easy for the boundedness condition on we just need to find a matrix B such that: 


Vxi,x 2 Hpi (xi) — H(j>i (x 2 ) < B 


(44) 


where the order relationship is the standard (semi) order between symmetric matrices: B 3 > B 2 if their difference is 
positive semi-definite. This captures the idea that the range of the second derivatives is small. 

The boundedness condition on the higher-derivatives requires us to define a norm on tensors. We will simply use the norm 
induced on tensors by the L 2 norm on vectors. This is defined, for a tensor of order k by: 


||Tj| = max 

V\...V k 


T[v 1 ... v k ] 

UWvih 


(45) 


For vectors (who are order 1 tensors), this is of course the L 2 norm. For matrices, this corresponds to the maximum 
eigenvalue. 
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This induced norm has good behavior when we input an order k tensor T % with only l inputs: the resulting order k — l 
tensor T^-i = T*, [i>i... Vi, —] has bounded norm: 

i 

l|T fc -,||<||T,|iniKII 2 (46) 

i -1 

This simply follows from the definition of the norm. We will use this fact several times because we often need to perform 
Taylor expansion of gradient vectors or of Hessian matrices. For example, if we perform an order 0 Taylor expansion of 
Hfa (x) around /xo, we can use this result to bound the matrix-valued reminder term: 

i ? 3 (x) = Hfa (x) - Hfa Oo) 

11^3 MU < ll x - Mo|| 


Once we have defined this norm, we can define the infinity norm for a tensor-valued function like the third derivative and 
the fourth derivative. We can then state our second assumption in p-dimensions: 


b {z) 

l 

a(T 


< k 3 

< A', 


(47) 

(48) 


5.3.2 Updating the proof 

Let’s now see how our proof changes now that we are in the high-dimensional case. 


Summary of the differences All the important steps of the proof are identical: the Brascamp-Lieb theorem, the 
Cramer-Rao bound, the Stein’s method trick all work in high-dimensions. Thus, the algebra of the proof is the same. 
What does change is the bounding of the error terms: throughout the proof, we repeatedly bound E — /x^) 2 ^ which 

corresponds in high-dimensions to E^. ^||5C — Hh i || 2 ^. In ID, we bounded this using the coarse bound on the variance of 
the hybrid we got from the Brascamp-Lieb theorem: 

var^. ( x) < {Hfa (p 0 ) + /? - Bfa 1 


We will do the exact same thing in high-dimensions. Let’s note Q the precision matrix of the cavity distribution. The 
coarse variance bound reads: 

Cov hi (x) < [Hfa (fj, 0 ) + Q Bp 1 


which gives: 


Ehi (jl x - M/iJI 2 ) < Tr ([Hfa + Q - B] ^ 


< P 


[Hfa (/xq) + Q B] 


-l 


which gives us the effect of the dimension parameter p on our results: it scales the error (at most) linearly. 


An example: going through the bound on Q, Now that we have given a high-level explanation of the difference 
between the two cases, let’s detail how the proof of the result needs to be extended for the bound on Qj. 

In the ID proof, we bounded: var/^ (Hfa (x)) using a Taylor expansion. In high-dimensions, this gives: 


l\Hfa (x) - Hfa (n hi )\\ < K 3 ||x- n hi \\ 

\\Hfa (x) - Hfa (n hi )\\ 2 < K% \\x - Hhif 
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We then have: 


var hi (Hfa(x)) < I<lE hi 


Vhi 


II 


< 

< 

< 


K$Tr (Cov hi (x)) 

KjTr ([Hfa ( Mo ) + Q B]' 1 ) 


PKI 


[Hfa (mo) + Q B]- 1 


where we have used the coarse bound on the covariance of the hybrid distribution that we obtained from the Brascamp-Lieb 
theorem. 


The high-dimensional theorem This then leads us to the following theorem stating our limit result in high-dimensions. 
Note that the “high precision” limit here means that all eigenvalues of Q should go to infinity. 


Theorem 6. High-dimensional extension 

Th. [5] also applies when approximating a target distribution over a p-dimensional space. The small term is then 


e = 


max (K 3 , K 4 ) Tr {[Hfa (p 0 ) + Q 


B 


-1 


< p max (K 3 , K 4 ) [Hfa (mo) + Q - B]~ 


(49) 

(50) 


Remark. In this result, the error scales linearly with the dimensionality of the space over which the target distribution is 
expressed. This means that, in the non-parametric case for which p = n, our results stop providing useful bounds, and it 
wouldn’t be possible to prove that EP is asymptotically exact. We believe that this is a weakness of our proof and not of 
EP, and that a more careful bounding of the error would show that EP still behaves properly in non-parametric problems. 


5.4 Limit behavior under weaker assumptions 

Our choice of constraints in the main text is restrictive: while it can be applied for wide classes of statistical models, it’s 
still quite far from being a necessary condition. In this section, we expose an alternative set of assumptions such that a 
variant of theorem [5] holds. 

In this section, we will simply use a global bounding of the site: 


h (z) < 1 


(51) 


We will also replace our global assumption constraining the higher log-derivatives of the sites with a purely local constraint: 
we assume that there exists a compact neighborhood I of po such that fa is derivable four times, and that those four 
derivatives are bounded inside I. 


Theorem 7. Global bounding of the site (eq. 51) and local smoothness constraints are sufficient for th. [ 5 ] to hold 


Proof. The key idea behind this proof is the following: we will show that under our assumptions the sequence of probability 
densities fa (x\(3) converges towards their restrictions to I. This convergence is strong enough to imply convergence of the 
parameters of the Gaussian approximation. Since the restriction of the site to I: f ( x ) 1 (1 £ I) fulfills the hypotheses of 
the weaker theorem, we get the claimed result. 


Convergence of fa (x\P) As a first step, let’s slightly rewrite the “cavity” Gaussian by centering it around its limit 
mean /xo : 


q-i 


ex. exp (^ — ^fX 2 + {(3po 
(x exp (x - pof 



Sr (x — 



25 











Now consider the sequence of probability density functions: 

hi (x\f$) = ex P f ( x — ho) 2 ~ Sr (x — ho)) ■ Note that these are improperly normalized. However, in the limit, 

the sequence of integrals f hi (x\/3) —> 1 as we will now show. 

Note Mp ( t ) = f exp (ty/P (x — yo)) hi (x\p) dx: this is the moment generating function of the random variable yp = 
y/P (xp — y o) (though, once again, note that this is only an asymptotically properly normalized density as our proof that 
Mp (0) —> 1 will show). For any P > 0, Mp (t) is finite (the quadratic terms in the log dominate and /, (a:) < 1). 

Let’s prove that Mp (t) converges pointwise for any value of t to exp This will prove that yp converges to a 

Gaussian of variance 1, and that its density is asymptotically properly normalized. 

Fix t. Let e > 0. Since the derivatives are bounded on X, there must exist a ball centered on y o : B (yo , r e ) C X, where 
|li (x) — l (/io)| < e. Let’s decompose the integral in Mp into the integral over B (where that simple bound holds) and the 
one over M./B. 

For the central region: 


cent (/?) = J exp (y/pt (x - y 0 )) h t (a 


> 


> 


1 ‘ 6 ■ ^ J exp ^ yfpt (x - y 0 ) - ^ (x - yo) 2 - Sr (x - y 0 )j 

t V , (t — Sr/ VP ) 2 \ 


k (ho) 


k (ho) P 


P 


U (yo) V 2tt J b 


— exp - x - y 0 - 


VP 


and a similar upper-bound. 

As p -+ oo, all of the mass of y^exp (x - y 0 - jpf') becomes concentrated inside B. Thus, the bounds converge 

to ex P (t) ^P^oo. 


For the exterior region: 


ext (P) = J exp (yfpt (x - ho)) hi (a 


< 


< 


lx£B 

VP 

h (ho) 

VP 

k (ho) 


J exp (^yfpt ( x ~ ho) ~ ^ (x - ho) 2 — Sr (x — y 0 )^j 


'x<£B 


exp 


p 


X - ho 


t \ 2 ( t-Sr/VPy 


VP 


which converges to 0 as p —> oo (exponentially fast), since all of the probability mass of v^exp (% — ho — pjp) 


is 


concentrated inside B. 

With those convergences, we could thus find /?o such that V/3 > Po, 


Mp (t) — exp 


< 3eexp ( — ) (k (yo)) 


(52) 


This proves that Mp (t) converges pointwise to exp _ 

Mp (t) is the moment generating function of an unnormalized probability distribution. The moment generating function 
for the normalized variable is simply found by taking the ratio against Mp (0). Since Mp (0), the MGF of the normalized 

density also converges to exp (V) : 

,. Mp (t) (t 2 \ 

1,ra '>-“yuoW exp (T' 


Note that this means that the sequence of random variables yp converges in MGF to a Gaussian centered at 0 and with 
variance 1. Convergence in MGF implies weak-convergence and convergence of all moments, and implies the convergence 
of Xp to a Gaussian centered at y 0 and with variance /3 _1 . However, this is a secondary point: we will have to work 
directly on Mp (t) in order to derive our next results. 
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Convergence of hi (a: |/3) to its restriction to X We will use this MGF convergence to prove that hi (cc|/3) converges 
to its restriction to X: h r t (x\fi) = hi (x\/3) 1 (x G I), and furthermore that the error we make in this approximation are 
negligible compared to the limit behavior of interest in ph t and in var^. 

Let’s first prove the convergence of the variances. h\ fulfills the hypotheses of the less-general theorem [5j so that: 

va,i h r (x) « /3" 1 + /3 _2 </>-' (mo) + O (/3 -3 ) 

The important feature is that the deviation of interest from the /3 _1 limit is of order /3 -2 . 

Let’s compute the variance for the unrestricted hybrid hi. We can decompose the variance into the contribution from X 
and the contribution from R\I. The contribution from the outer region can be upper-bounded by the following argument: 

• Let to such that exp (— |<o (x — /ro)|) (x — poj 2 is strictly decreasing on the outer region R\I. 

• Let’s recall the Markov-bound: it proves that a positive random-variable with finite mean can not have arbitrarily 
large deviations. It can be extended to prove that a positive random-variable with finite k th moment must also have 
bounded moments for all k 2 < k. 

Because exp (— |to (x — t*o)|) (x — /io) is decreasing, we can find a Markov-like bound on the contribution of the 
outer region of hi to the variance. Note p the radius of X and consider the following bound on (x — po) 1 (x ^ I): 

exp(- \t 0 (x - p 0 )\) (x - /i 0 ) 2 1 (a: $ I) < exp (- \t 0 p\) p 2 

(x - pof 1 (x i I) < exp (|t 0 (x — Mo)|) ex P (— |t 0 p|) P 2 

When we take the expected value under hi, we obtain the Markov-like bound: 

Ehi ({x - /i 0 ) 2 1 (x £ X)) < E hi (exp (\t 0 (x — Mo) I)) exp (— |top|) p 2 


Now consider adapting that last bound to the value of j3 with tp = \fj3to (note that for t > to, the corresponding 
function is still decreasing outside of X). We get: 

Ehi ((z - Mo) 2 Ux$. X )) < {Mp (~t 0 ) + Mp (t 0 )) exp (- yffitop ) p 2 

and since the Mp (±to) converge to exp ^ , we have that, for sufficiently high /3: 

E^ (i x - Mo) 2 1 (x 4 - 2: )) < ex P ex P (- \fPt 0 p 


This proves that the contribution of the outer region to the variance decreases exponentially in yf]} which is much faster 
than the error in the central region which is in /3 -2 . 

The error in the central region is thus found to dominate the other error terms so that varr 1 — fj —» var/jV 1 — /3. 

By a similar argument, we can prove that the deviation of the mean from p 0 is also dominated by the error in the central 
region, and that the error contributed by the outer region decays exponentially. This is done by finding to such that 
exp (— |to (x — Mo)|) \ x ~ Mol- This proves that: vai^Eh, (x) — (3po —> var ^}E^ (x) — (3po exponentially fast, while the 
size of the central term is of order 1. 

We thus have that the Gaussian approximations of the hi (x\/3) converges towards the Gaussian approximations of (s|/3). 
Since we can apply the weaker theorem to h\ (x\j3), we get the claimed result. □ 


5.5 Detailled proof of the stable region theorem 

In this section, we detail the proof of the main text theorem on stable regions of aEP and EP. 
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Theorem 8. Convergence of fixed points of EP and aEP 

There exists an EP and an aEP fixed point close to the CGA of p ( x) at x * if (ft (x*) is sufficiently large. More precisely, 

if: 


SaEP = max (K 3 ,K 4 ) Y] |<fo (a?*) (x*) 

5 = nmax (if 3 , K 4 ) \ip (x*) 


1 -1 


1 -1 


are order 1 quantities and ip (x*) is large, then the limit of the stable regions on the global approximation, nA r and nAp, 
scale as O ( S a EP + 6 ) for aEP and as O ( 6 ) for EP. 

Proof. Let’s start with aEP. Let’s assume that we start from region of the parameter space with the following form: 


VaEP — PaEPX* | < 7 (^ + $ aEP ) 

PaEP - Ip {■X *) < 7 (<5 + S aE p) 


The critical feature here is that the stable region has size of order 1 and that ip (x*) is large. 
We will now apply th. [5] with po = x * to bound the next parameter values: 

-]7 


1 aEP 


= PaEpX* +0 (n max (K 3 , Kf) ip (x*) 


PaEP = V’ (»*) + O ( nmax (K 3 ,K 4 ) ip ( x *) 


-1 


+0 max (^3,^4) 


In those equations, we have used the fact that O 


| r a EP - /3aEPX*\ + ^ \(pj (x*) 


i —1 


Ip" (X*) - l3 a EP - Ip" {X*) 


-1 


ip (x*) 


= O 


ip" (x*) 


-1 


(53) 


Note how 


the order 1 deviation: 


PaEP - ip ( x *) 


does not affect the limit behavior. 


Similarly, the order 1 deviation in r: \r a EP — P u ep x* |, is similarly “asymptotically silent”: \r a EP — PaEpx*\ ip (x*) 


is negligible before 1 ( x *) ip (x*) 

The limit behavior thus simplifies into: 


1 -1 


VaEP = PaEPX* + O (< 5 ) 

Pa EP = Ip ( X *) + 0(6+ SaEP ) 


When ip (x*) is sufficiently large, there exists c such that: \r a EP — PaEpx*\ < cS. If 7 > c, then this proves that the aEP 
iteration is contractive for the linear-shift natural parameter. 


Similarly, when ip (x*) is sufficiently large, P a EP — ip (x*) 
contractive in both directions which proves the result. 

The result for EP follows from the exact same reasoning. 


< C2 ((5 + S a EP ), and if 7 > max (c, C2), the aEP iteration is 

□ 


5.6 Exactness of aEP and EP in the large-data limit 

In this final section, we will prove our only non-deterministic result concerning the behavior of EP and aEP. We will prove 
that, if we accumulate sites which are generated according to some process guaranteeing Local Asymptotic Normality 
of the log-posterior, then aEP and EP have a fixed point which converges to the CGA in total-variation. Furthermore, 
if the process generating the sites respects some assumptions which constrain the mass of the posterior outside of a 
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close neighborhood of the highest-mode, then the posterior p n (x) and its CGA q n (x) converge towards one another in 
total-variation. This ensures that aEP and EP both have a fixed point which is asymptotically exact in total-variation. 
We were unable to obtain a Bernstein-von Mises results which offers exactly the conditions we needed, so we we had to 
re-derive every result from scratch. We make no claim for originality for this section which is extremely similar to many 
other asymptotic studies of likelihoods and posterior distributions. 

Let’s first detail the assumptions that we will need on the random process generating the sites l t (x). We will assume that: 


• the li are i.i.d 

• Their distribution is such that the expected value function: 

x ->• E((j>i (x)) 




has a global maximum at xq. 
their distribution is such that 


We will note Iq = E ^ (cco)^ > 0. 
the following quantities are finite: 


var 


var 

E 


(xo 

( 4>i Oo) 


< 

< 

< 


00 

00 

00 


Their size controls how regular the process generating the sites is. Note that, in the well-specified case, var 


E 


= J 0 by integration by parts, in which case /q is the Fisher information provided by the observations. 


These conditions will ensure that the posterior is Locally Asymptotically Normal so that it can be approximated locally by 
a Gaussian. Furthermore, in order to have a global approximation, we will require that the model is such that estimation 
with it is consistent: with probability tending to 1 , it converges towards xq which is the parameter that fits the data best. 
The condition is that: Ve > 0, the random variables / p n (x) 1 (\x — xo| < e) dx converge to 1 in probability as n —> oo. 
Armed with these assumptions, let’s now derive several results on the log-posterior. 

Lemma 9. As n — > oo, the log-curvature at xq grows linearly. More precisely, Ve > 0: 

n 

^2 <i>i Oo) > n{I 0 - e) 

i =1 

with probability tending to 1. Similarly, the log-gradient at x$ is of order ^Jn. With probability tending to 1, we have that: 

n 

Yl&i Oo) 

i=i 

Proof. This result is simply a large-data limit concentration result. 

The mean of the cumulative sum is uIq while its variance is ?rvar {^j> i (a:o)^- By Chebyshev’s concentration theorem, the 
probability that the cumulative sum deviates from its mean decays at speed yfn. 

By the exact same argument, we get that the cumulative log-gradient is small: its mean is 0, by definition of Xg, and its 
variance is nvar (^cf> i (:co)) • Once again, by applying Chebyshev’s theorem, we get the claimed result. □ 

Lemma 10. As n —> oo, there exists with probability tending to 1 a mode a;* in close proximity to Xq. More precisely: 

x 0 -x* n = O (™ -1/2 ) 


< fnlvar^l (a; 0 )) 


Throughout the following lemmas, we will always refer to a:* without explicitly mentionning that a;* does not always exist 
(but has probability tending to 1 of existing). 
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Proof. With lemma |9j we have that with probability 1, the log-curvature at xq grows linearly and the log-gradient is of 
order y/n. Assume for the following that the log-gradient at Xq is negative. 

We also have that the third log-derivative is bounded: 


£#( 


i= 1 


< nK :i 


Let’s then consider the following Taylor lower-bound to the log-gradient: 

n n / ra \ , %2 

£ 4>i (x) = £ <t>i (xo) + 0 - X 0 ) £ </>" (x) ) - nK 3 

*=1 i =1 Vi=l / 

If the log-curvature is sufficiently large, and the log-gradient sufficiently small, the second degree polynomial in x has two 
roots. More precisely, this happens when the discriminant is positive: 


Disc = ( (p” (x) ] - 4 


1 


£ 4>i Oo) 


i =1 


riK-t 


Given the growth rates of the log-gradient and of the log-curvature, we have that, with probability 1, this discriminant 
is positive, because it grows asymptotically as yfn. Thus, in the large-data limit, we are guaranteed to have a root in 
close vicinity of So- Furthermore, it is easy to check that the dominating term in x* is the root of the order 1 polynomial 

E?=i <t>i (xo) + (x- xo) (ELi <t>i (*)) = 0 : 


-1 


- Zo = - £ <t>i (x )) £ <t>'i ( Xo ) + O I 


-1\ 


,1=1 


K i= 1 


This leading term is of order n which concludes this prooQ 

Lemma 11. The log-curvature at a;* grows linearly with probability tending to 1. More precisely, Ve > 0: 


□ 


£ «) > n(Io - e) 


with probability tending to 1. 

Similarly, the following quantity grows linearly: 


n 

£|l«) <t>i(x o)) 


Proof. We compute the log-curvature at x* from the log-curvature at Xq with a Taylor expansion: 


£ <t>i «) - 4>i (x 0 ) 


< nK 3 \x* n - Xo\ 


Since |x* — Xo| is of order n 1 / 2 , we get that the error between the two is of order ,fn and that the log-curvature at x * 
grows linearly. 

Similarly: 


£k«) -£k(*o) 


i =1 i =1 

and we have that this quantity also grows linearly. 


< (£ \<t>l ( x o) 


,i=i 


\X n ~Xo\ 


□ 


7 This proof is straightforward to extend to the high-dimensional case. One simply needs to diagonalize the (fff "—, b, : [xyj matrix. 
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With this result, we reach a turning point in our proof. So far, we have proved that x* exits and that various quantities 
measured at a;* grow linearly (with probability tending to 1). This proves that the log-posterior is Locally Asymptotically 
Normal (LAN) around a:* (with probability tending to 1). 

We will now show that the LAN behavior of the posterior ensures that aEP and EP have a stable region in a small 
neighborhood around a;* and that all global approximations of p n (a:) inside that stable region converge towards q n (x): 
the CGA of p n (x) at x*. 

Lemma 12. We can apply main text th. 3 to x * with probability tending to 1. 

Thus, there exists a stable region of order 0: 


nA nr = O (n°) 
nA nj/3 = O (n°) 


around the CGA at x*. 


Proof. The log-curvature at the mode x* grows linearly in n and so does nmax (A 3 , K 4 ). Similarly, for aEP, 
grows linearly. Application of the theorem then gives that there exists a stable region in the large-data limit. 



Lemma 13. All Gaussian approximations ofp n (x) inside the stable region of lemma 12 converge in KL divergence to 
q n (x): the CGA centered at x*. This convergence in KL divergence implies a convergence in total-variation. 

More precisely :Vr, /3 such that: 


then: 


V~ Px* n \ 

n 

( X n) 

2—1 


nA nr 
< nA n> p 


KL (J\f (x|r, /?), q n (x)) = O (n 

drv (Af(x\r,/3) ,q n (x)) = O (n~ 1/2 ^J 


Proof. The formula for the KL divergence between two Gaussian distributions (parameterized with mean and variance) 
qi = AT (/Ui, Pf 1 ) and q 2 = J\f (/i 2 , (Tf 1 ) is the following: 


2 KL( qi ,q 2 ) 


P2 (pi - P- 2) 2 + ^ ~ l0 S (ftM) 

a I \2 , (/^2 Pi ) 2 

P2 {p 1 — P2) H- 7^52 - 


where the approximation is valid when 


(fe-ft) 

01 



8 In the high-dimensional case, the KL divergence is: 


2 KL (gi, qo) 


(Pi ~ P 2 )Q 2 (pi 


P2) + Tr (chQj - d - log 


\QA \ 

\Qi\J 


If Qi and Q2 are co-diagonal, the complicated term Tr (cpQ | 1 j — d — log ( q(| J is, like the ID case, almost quadratic. Even when they 
are not co-diagonal: 

log (IQ2 + A|) « Tr (^Qr 1 A + ^Q " 1 AQ " 1 ) 
so that the quadratic approximation is still valid: 

2 KL (gi,<j 2 ) ~ (pi - p 2 ) Q2 (pi ~ P2) + -Tr ([Q2 - Qi ] 2 Qf 2 ) 
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When we apply this expression to our distributions of interest, we find the following upper-bound for the KL divergence 
between any Gaussian approximation in the stable region and the CGA at x* (slightly abusively, we use the quadratic 
limit approximation of the /3 dependent term): 

2KL 

2KL 
2KL 


< 4i ( X n)J 

< 0 (n" 1 + n" 2 ) 

< O (nT 1 ) 


nA ri 


ELl fa'i «) - nA n,p / 2 VE”=l fa'i ( x n) - nA n,P 


nA 


n,(3 


We finally apply Pinsker’s inequality, which relates the KL divergence to the total-variation metric: KL >2 (cItvY and 
we get the claimed result. □ 

Lemma 14. Kleijn et al. ^2012) If p n (x) is Locally Asymptotically Normal and its mass concentrates near x *, then: 

drv {Pn (x ), q n {x)) = O (w 1/2 ) 


Proof. We just sketch the proof of Kleijn et al. 

Fix e > 0. Under our assumption, with probability 1, the mass of p n {x) concentrates inside the ball |cc — a^o| < e as 
n —» oo. 

By a simple Taylor expansion inside the ball around x*, we find that the CGA q n (x) is a good approximation of p{x). 
Some care must be taken to ensure the rate of convergence holds. □ 


With these last two lemmas [13] and [14] we conclude our proof of the theorem: under our conditions, all aEP and EP fixed 
points near the CGA at x* converge in total-variation (with speed O (n -1 / 2 )) to the true posterior p n (x). 


5.7 Exactness on probit and logit regression 


Let’s now show that both for probit and logit regression, EP is exact. 
In both cases, the likelihoods li (x) have the following simple form: 


k ( x ) = a (v*x) 

where v, is the vector of predictors for the i th datapoint and a is the link function (or activation function) of the model. 

Both for the probit and the logit case, a is a log-concave function with bounded derivatives of all orders. Thus, as long 
as the predictor vectors v,; are guaranteed to be bounded, then the l, are guaranteed to respect our assumptions on the 
derivatives (eqs. 16 and 17). Furthermore, the log-concavity of a makes it so that we can get rid of the identifiability 
assumption, as identifiability is implied by the other assumptions. 

Lemma 15. If the Fisher information matrix if strictly positive Iq > 0, then the posterior distribution is identifiable. 


Proof, the posterior distribution is a product of log-concave distributions. It is thus log-concave. 

Furthermore, for the posterior, the log-curvature at x* is growing linearly almost at speed Io and the log third derivative 
is bounded by 

The case with the larger tails that fits this picture is somewhat like the Huber log-likelihood (quadratic center with linear 
tails): it has a central curved region surrounded by linear tails. More precisely, the log-curvature of the posterior is larger 
than: 

41 (x) > max ^ 

2 = 1 \ 2=1 

We can integrate the limit behavior of the log-curvature in this expression yielding: 

n 

’YY,4'i ( x ) > nmax(I 0 - K 3 ||x-x*|| ,0) (55) 

2=1 


E 


i ( X n ) ^ A-3 11 x X n 11 ’ 0 


(54) 
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Critically, the log-curvature grows (approximately) linearly with n. If we simply integrate the above expressoin twice, this 
remains true for the negative log-posterior: 


Y (& (x) - fa (x*)) oc n 


(56) 


In order to conclude, we will use a brand new result: theorem III. 1 of Pereyra (2016). Pereyra shows that, for a log-concave 


probability distribution, most of the mass is concentrated in a region where the negative log-posterior is not too high above 
the minimum. Eq. [56] shows that the negative log-posterior grows linearly with the number of data-points n. By applying 
Pereyra’s theorem, we have that the probability concentrates around x* which concludes our proof. □ 


A probit or logit regression thus respects all of our hypotheses. Thus, in the large-data limit, both aEP and EP are exact 
on a probit or logit model, as long as the Fisher information matrix is strictly positive. 


6 aEP: an alternative to EP ? 


In the main text, we used aEP as a theoretical tool to study the asymptotics of EP, but could it hold practical interest 
as well? aEP is simpler than standard EP, and in particular it requires fewer matrix factorizations. In standard EP 
computing the covariance matrix of cavity distributions is a O ( nm 3 ) or O (nm 2 ) operation (where m is the number of 
parameters), depending on the problem and the implementation. In aEP the covariance of the cavity is just —^-E, the 
current covariance matrix, so that forming the cavity distribution is a very cheap O (m 2 ) operation. Depending on m 
aEP may be substantially faster. 

Another advantage of aEP is that, due to its much smaller parameter set (O (m 2 ) vs O (nm 2 )), generic numerical tools 
for fixed point iterations may be used out-of-the-box. We experimented with R package SQUAREM (Varadhan and 


Roland 2008), a set of algorithms that seek to accelerate fixed point iterations. Our limited experimentation indicates 


that although SQUAREM does not always achieve speed-ups, the fact that step sizes are chosen makes aEP very robust. 
On the other hand, since aEP is an asymptotic approximation of EP, we will incur a loss in performance. We ran 
some simulations to see how different aEP and EP are in practical examples. We picked two statistical models, Cauchy 


regression and probit regression. Probit regression is a well-know EP success story (Kuss and Rasmussen 2005 Nickisch 


and Rasmussen 2008), with well-behaved, log-concave sites. Cauchy regression features non-log concave sites and posterior 


distributions may be multimodal. 

Probit regression is the following model: 


Vi = sign(x-a + e) 
e ~ A/" (0,1) 


where x, ; is a vector of covariates, while Cauchy regression is: 


y% = x‘a + e 
e ~ Cauchy (0,1) 

in both cases inference is for the regression coefficients a. We used the standard factorization of the posterior (over 
likelihood sites) with hybrid moments computed numerically. The prior over a. was set in both cases to a ~ A? (0,1). The 
regressors x, £ R 4 were B-spline functions evaluated on a grid of n locations over the unit interval. Data were generated 
according to the model. We ran aEP and EP for 20 passes at speed 7 = 0.4, since the Cauchy likelihood induced occasional 
convergence problems. 

To measure the difference between aEP and EP, we used a relative difference in means: 

= ( 57) 
mm (c T E p,cr a EP ) 

which expresses how different the estimates are in units of standard deviations. Differences in estimated posterior variance 
was summarized by a ratio: 
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Figure 4: aEP vs EP in probit and Cauchy models. Note that the vertical scales are not the same across panels. Upper 
row: relative difference in means (eq. ©) between aEP and EP. Lower row: ratio of standard deviations between aEP 
and EP ( d a in (58)). Dots represent individual simulations, the continuous line connects the means. 


d a = max 


/ aEP a a EP 
X^aEp’ &EP 


(58) 


Both measures were averaged over the m = 4 parameters. 

The results are shown on fig. [4] aEP and EP are both exact in large n, but the differences between the Cauchy and the 
probit model are notable (one order of magnitude). In the probit model aEP and EP are practically the same with just 
20 datapoints, with relative differences in means reaching a maximum of 5%, whereas differences in the Cauchy model 
can reach 40%. With enough datapoints the differences disappear in the Cauchy model as well. 


When applying aEP, one must be careful to remember that it uses the approximation that Vi, A i ss This 

means that if a few sites are outliers and have a very exceptional contribution to the posterior, then they might make 
this approximation wrong and aEP might give a very poor approximation. If applying aEP, it thus seems sensible to 
check the validity of the assumption at the last point of the iteration as a simple sanity check to verify the quality of the 
approximation. 

One interesting possibility is to run aEP until it gets close to a fixed point to take advantage of the smaller amounts 
of computations, and then switch to the EP iteration to take advantage of the possible increased precision of the EP 
algorithm. 
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